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Forward genetic screens in model organisms are vital for identifying novel genes essential for developmental or disease 
processes. One drawback of these screens is the labor-intensive and sometimes inconclusive process of mapping the 
causative mutation. To leverage high-throughput techniques to improve this mapping process, we have developed a 
Mutation Mapping Analysis Pipeline for Pooled RNA-seq [MMAPPR) that works without parental strain information or 
requiring a preexisting SNP map of the organism, and adapts to differential recombination frequencies across the genome. 
MMAPPR accommodates the considerable amount of noise in RNA-seq data sets, calculates allelic frequency by Euclidean 
distance followed by Loess regression analysis, identifies the region where the mutation lies, and generates a list of putative 
coding region mutations in the linked genomic segment. MMAPPR can exploit RNA-seq data sets from isolated tissues or 
whole organisms that are used for gene expression and transcriptome analysis in novel mutants. We tested MMAPPR on 
two known mutant lines in zebrafish, nkx2.5and tbxtj and used it to map two novel ENU-induced cardiovascular mutants, 
with mutations found in the ctr9 and cds2 genes. MMAPPR can be directly applied to other model organisms, such as 
Drosophila and Caenorhabditis elegans, that are amenable to both forward genetic screens and pooled RNA-seq experiments. 
Thus, MMAPPR is a rapid, cost-efficient, and highly automated pipeline, available to perform mutant mapping in any 
organism with a well-assembled genome. 



[Supplemental material is available for this article.] 

Forward genetic screens in zebrafish have identified a large 
number of genes essential for organogenesis (Driever et al. 1996; 
Haffter et al. 1996), laterality (Chen et al. 2001), axon guidance 
(Xiao et al. 2005), and cancer development (Moore et al. 2006), 
many of which have been linked to human disease. Similar 
large-scale forward-genetic screens have been and continue to 
be performed in mice (Yu et al. 2004; Garcia-Garcia et al. 2005), 
Drosophila (Nusslein-Volhard and Wieschaus 1980; Medina et al. 
2006), and Caenorhabditis elegans (Brenner 1974; Hughes et al. 

2011) . However, mapping mutants in many species traditionally 
has been labor intensive and often inconclusive, especially in or- 
ganisms with relatively complex genomes. 

Several methods exist to expedite genetic mapping. For 
example, genotyping DNA pooled from phenotype-sorted in- 
dividuals (bulk segregant analysis) has long been a standard 
method for low-resolution genetic mapping. Bulk segregant 
analysis provides a qualitative estimate of the linkage between 
a given marker and the mutant locus, while greatly reducing the 
time and expense of genotyping. However, this method is still 
labor intensive because it requires that each marker be analyzed 
individually. 

The development of techniques using genotyping arrays 
(Tabernero et al. 2012), genomic resequencing of individuals 
(Warren et al. 2012), and exome-capture sequencing (Lin et al. 

2012) have made mapping mutations much more rapid in hu- 
man populations by allowing multiple markers to be analyzed 
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simultaneously, but they have been less widely adopted in many 
model organisms because of incomplete genomic annotation, high 
polymorphism rates, and the costs associated with performing these 
analyses on large numbers of individuals. Recently, several methods 
to use whole-genome sequencing techniques to model organisms 
have been proposed for Arahidopsis thaliana (Schneeberger et al. 
2009; Cuperus et al. 2010; Austin et al. 2011; Uchida et al. 2011), 
zebrafish (Bowen et al. 2012; Leshchiner et al. 2012; Voz et al. 2012), 
mice (Arnold et al. 2011), and C. elegans (Doitsidou et al. 2010; 
Zuryn et al. 2010). 

An alternative to whole-genome sequencing (WGS) is RNA-seq, 
which is less expensive because the transcriptome is smaller than 
the genome, allowing greater read depth to be achieved with fewer 
reads. The utility of RNA-seq analysis for mapping has been dem- 
onstrated recently in self -pollinated individuals derived from in- 
bred mapping strains in maize (Liu et al. 2012), but this has not 
been tested in more noisy data sets from outbred animal pop- 
ulations. In addition to mapping, RNA-seq is becoming a standard 
analysis method in model organisms for determining the gene 
expression and splicing changes underlying phenotypes derived 
from both forward and reverse genetics (Aanes et al. 2011; Rosel 
et al. 2011; Vesterlund et al. 2011). 

Because RNA-seq of pooled individuals can be used for dif- 
ferential expression analysis to further understand the phenotypes 
of novel mutants from forward genetic screens, we sought to de- 
velop a method to use these data to identify the causative mutation 
underlying the observed phenotype, thus creating an inexpensive 
and rapid alternative to traditional mapping procedures. We have 
designed our method, which we call MMAPPR (Mutant Mapping 
Analysis Pipeline for Pooled RNA-seq), to use the data and exper- 
imental design typical in RNA-seq-based transcriptome experi- 
ments directly. Although this study goes through the principles 
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and optimization of MMAPPR analysis as well as the details of 
successfully mapping four mutants, the average user will not be 
required to have this level of expertise and can simply process 
their data sets through our program (available at http://yost. 
genetics.utah.edu/software.php) in order to identify their mu- 
tant genes. We have validated MMAPPR on two known mutants, 
nkx2.5 (KVTargoff, unpubl.) and tbxl (Piotrowski et al. 2003), and 
two unknown mutant lines, zyl3 and zyl4, identified in an ENU 
screen performed in our laboratory. MMAPPR was then used to 
identify a genomic region containing the mutation and generate 
a list of nonsynonymous mutations that serve as candidates for 
the gene encoding the causative mutation. In each case, the 
identified causative mutation was <1 cM from the maximum 
score generated by MMAPPR, indicating that MMAPPR is able to 
identify mutations derived from a forward genetic screen in 
zebrafish successfully and accurately. In addition to zebrafish, 
MMAPPR can be directly applied to other organisms, such as 
Drosophila melanogaster and C. elegans, in which both forward 
genetic screens and pooled RNA-seq experiments are common, 
thus removing a significant barrier for performing mutagenesis 
screens in model organisms. 



Results 

We developed a novel method, MMAPPR, for identifying recessive 
mutations identified in forward genetic screens (outlined in Fig. 1). 
Briefly, MMAPPR uses RNA-seq data from F2 embryos separated by 
phenotype into two pools: wild- type phenotype (which includes 
homozygous wild- type and heterozygotes), and mutant pheno- 
type (which includes homozygous mutants). Candidate molecular 
mutations are then identified based on three criteria: physical lo- 
cation in the linked region, expression at the time of tissue col- 
lection, and effect on protein amino acid sequence. 

RNA-seq data contain many thousands of single nucleotide 
polymorphic markers (SNPs) spread across the entire genome, 
making it an ideal source for high-throughput mapping. However, 
these data sets are extremely noisy due to the variable expression 
levels of individual genes across the genome at the time of RNA 
collection, PCR amplification artifacts, sequencing errors, map- 
ping errors, and genome annotation errors. MMAPPR compen- 
sates for the noise inherent in RNA-seq data sets to map muta- 
tions. MMAPPR encompasses five steps: (1) RNA sequencing and 
mapping, (2) allele frequency distance calculation, (3) signal 
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Figure 1. Schematic overview of MMAPPR. (A) Mating scheme. Fish from the mutant line are outcrossed and then F1 progeny, heterozygous for the 
mutation, are crossed. Pools of phenotypically mutant and phenotypically wild-type fish are then sorted. (B) Schematic representation of allelic segre- 
gation between wild-type and mutant pools. (Black) Genomic regions inherited from the mutant carrier line; (gray) regions inherited from the outcross 
line. (Bottom panel) A plot of the expected Euclidean distances calculated from the allele frequencies in the two pools. (C) Flowchart of the analysis steps 
incorporated in the MMAPPR algorithm. Each phase of the pipeline is shown by gray boxes, and the portion of the pipeline processed by the MMAPPR 
software package is shown by a vertical gray bar. 
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processing, (4) candidate single nucleotide polymorphism (SNP) 
identification, and (5) candidate confirmation. We have created 
the MMAPPR software package to perform steps 2-4, while steps 1 
and 5 involve bench work and preexisting software packages; 
however, because all five steps are integral to the process, each one 
is covered in more detail below. 

RNA sequencing and mapping 

Any pool of individuals derived from a cross between two hetero- 
zygous carriers of a SNP will contain a Mendelian distribution of 
genotypes (expected frequencies: 0.25 AA, 0.5 Aa, and 0.25 aa) at 
every SNP locus where both parents were heterozygous. Similarly, 
the expected allele frequencies of such a SNP are f A = 0.5 and f & = 
0.5. However, when the pool of individuals is subdivided into two 
pools based on a mutant phenotype, the expected allele frequen- 
cies for any SNP depend on its linkage to the mutation causing the 
phenotype. For example, a SNP located 10 cM away from the 
causative mutation has an expected allele frequency of 90% for the 
allele linked to the causative mutation and 10% for the comple- 
mentary allele. 

MMAPPR uses this principle by selecting polymorphic SNPs 
from mapped RNA-seq reads and calculating the SNP allele fre- 
quencies in each pool, and then uses this information to estimate 
the location of the causative mutation (described below). There- 
fore, the theoretical resolution of these SNPs is a function of both 
population size and read depth. It is expected that increasing these 
factors would improve results until other factors such as data set 
noise and SNP density in the genome become limiting factors. To 
test the effect of population size, we crossed zyl4 heterozygotes 
and prepared RNA-seq libraries from 20 phenotypically mutant 
and 20 phenotypically wild- type sibling embryos. We did not ad- 
just this parameter to fewer than 20 individuals because of the 
amount of material typically required to build libraries for RNA- 
seq. This was performed three times, generating six libraries of 
three separate biological replicates for the zyl4 line. We then used 
MMAPPR to analyze the 20-embryo RNA-seq data sets pairwise 
(matched sets of mutant and wild- type pools), and then combined 
the RNA-seq data sets from the three independent biological re- 
peats to create a single data set representing 60 embryos. Surpris- 
ingly, increasing the number of individuals from 20 to 60 resulted 
in only a marginal decrease in the width of the detected peak (from 
7.1 Mb to 5.8 Mb for the zyl4 data set). It is important to note that 
improvements in peak size and shape are limited by data noise and 
SNP density. Therefore, the observations that the peak was only 
marginally improved by increasing numbers of embryos in the 
pools and that the peak maximum is <1 cM from the causative 
mutation indicate that we may have reached these limits, thus 
masking any improved resolution gained by increasing the num- 
ber of individuals in each pool. However, the increased number of 
individuals did increase the height of the peak from an average of 
0.73 in the individual data sets to 1.29 in the combined data set. 
Together, this suggests that RNA-seq libraries generated from 20 
individuals should be sufficient for mapping. 

An important question is how many reads are required in an 
RNA-seq data set for MMAPPR. We analyzed the effects of read 
count on peak width and height. Read depth across the genome in 
RNA-seq data sets is widely variable, with most regions having very 
low coverage (Supplemental Fig. SI). Due to the differences in gene 
expression levels at different developmental stages or in different 
tissues, SNP read coverage in any RNA-seq data is skewed and has a 
very high variance. Thus, the "mean coverage" across the genome 



(which is important for WGS analysis) is not a stable or intuitive 
summary statistic for RNA-seq-based analysis. Therefore, we used 
"total reads per pool" as a statistic that is applicable across a variety 
of RNA-seq experiments. Twenty data sets were created by ran- 
domly down-sampling reads from the combined (60 embryo) data 
set, generating data sets that contained from 1% to 100% of the 
reads from the original data. The 1% data set (containing 835,714 
and 861,895 reads in the wild-type and mutant pools, respectively) 
did not contain enough SNPs meeting thresholds to identify 
a linked region. The next larger data set (containing 5,008,193 and 
5,165,185 reads in the wild-type and mutant pools, respectively) 
generated a 10.1 -Mb peak, and all remaining, larger data sets 
generated peaks between 5.7 and 6.9 Mb wide (Fig. 2A). Peak 
maxima were also more volatile in the smaller data sets but did not 
trend toward a higher or lower score as the number of reads was 
increased (Fig. 2B). Therefore, both peak width and maximum 
score are robust at read depths greater than 10 million total reads 
per pool, which is within the normal range of RNA-seq data sets. 
However, increasing the read depth is expected to improve sensi- 
tivity for including genes in the final candidate list, which may 
significantly impact identification of low expressed genes. This 
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Figure 2. Effect of various parameters on peak width, height, and lo- 
cation. Reads were randomly sampled from the combined zy14 data set 
to create 20 data sets with 1 %-1 00% of the original data. (A) The width of 
the peak in each data set. (B) The height of the peak for the same data sets. 
The 1 % data set did not have a sufficient number of SNPs to identify a peak 
(X in A and B). (♦) All other data sets. (C) The effect of different mapping 
quality scores on peak location (distance of the peak maximum from the 
identified mutation). Mapping quality had little effect on peak position, 
but the best overall position for all data sets was at a phred-sca\ed mapping 
quality score threshold of 30. 
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information will be useful for strategizing how many barcoded 
samples can be included in a single lane of sequencing. 

Two other important considerations when designing RNA- 
seq experiments that will be used in MMAPPR are RNA collection 
time and tissue. It is important to point out that MMAPPR will 
identify the linked interval regardless of time point or tissue used 
for RNA collection. SNPs in neighboring genes that are expressed 
at the time of RNA collection should be sufficient to identify 
the linked interval, even if the mutant gene is not expressed at 
that time. Obviously, discovery of a SNP within the mutant gene 
requires that the gene be expressed to be part of the RNA-seq 
data set. We therefore suggest isolation of the affected tissue 
when feasible. However, as exemplified by mapping and iden- 
tification of previously uncharacterized zyl3 and zyl4 mutants 
from our laboratory (see below), the cell lineage that is perturbed 
in a mutant is not always known or readily isolated, so in some 
cases, RNA isolated from whole embryos can be used. RNA 
should also be collected as soon as the embryos can be reliably 
segregated by phenotype to increase the likelihood of the gene 
being expressed. 

After RNA collection and sequencing, reads are mapped to a 
reference genome. We used Novoalign (Novocraft) for read map- 
ping. Although the algorithm should be software independent, 
selection of optimized alignment parameters may differ due to 
small differences in alignment and scoring algorithms in other 
software. MMAPPR contains several signal processing steps (see 
below) to compensate for noise, but selecting the proper align- 
ment parameters also improves results. One common filtering cri- 
terion for genomic mapping is the minimum mapping quality score 
(mapq). This score is a phred-scaled probability that the read 
is misaligned and is affected by the quality of the sequencing 
read, the number of mismatches/gaps that occur at the putative 
mapping location, and the uniqueness of the sequence in the 
genome — all factors that can affect SNP calling. To identify the 
optimized mapq, we measured the distance from the peak maxi- 
mum to the known mutation in the nkx2.5 line using various mapq 
cutoffs (Fig. 2C). This analysis showed that a cutoff of 30 resulted in 
a peak maximum nearest to the known mutation. Retrospectively, 
this cutoff was also the ideal for all of the lines tested, although we 
did not optimize the other lines before analysis. Changing the score 
threshold did not have a large effect on any of the tested lines 
(largest range = 0.61 Mb), indicating that the algorithm is robust 
against the effects of low quality reads on SNP calling. 

Euclidean distance calculation 

After genomic mapping, two BAM formatted alignment files are 
submitted to the MMAPPR software package. MMAPPR then 
generates a de novo catalog of informative SNPs by identifying 
genomic positions at which there is a mixture of alleles in the 
phenotypically wild-type pool (containing both wild-type and 
heterozygous individuals). MMAPPR does not consider whether 
the base calls at these catalogued positions match the reference 
genome. This allows the algorithm to work regardless of parental 
background or the strain from which the genome build was de- 
rived. Small insertions or deletions (indels) and other genomic 
changes are not used due to the current limits of indel-calling 
algorithms. However, once these algorithms are reliable, they will 
be added to MMAPPR. The resulting data set is then processed to 
minimize the effects of noise. 

RNA-seq data sets are typically noisy due to a combination of 
several factors including library/PCR artifacts, sequencing error 



rates, mapping errors, etc. Therefore, measuring the relative allele 
frequencies between the mutant and control RNA-seq data sets 
requires a metric that is not susceptible to this noise. Traditional 
mapping methods typically use the log odds ratio (LOR) to mea- 
sure linkage disequilibrium as shown in the equation: 

LOR= Xo J a ™tl A ™A 
6 V awt/A wt J 

where a is the number of reads containing a nonreference allele at 
the position and A is the number of reads containing the reference 
allele at the position. This metric has several characteristics that 
make it well suited for mapping analyses. Among these is the fact 
that it is asymptotic with its limit corresponding with the actual 
mutation, increasing its sensitivity near the segregating locus. 
However, its asymptotic nature also makes LOR extremely sensitive 
to noise generated from inaccurate measures of allele frequency and 
to situations in which there is a zero in the denominator, both of 
which are common in relatively low-coverage RNA-seq data. The 
combination of these factors resulted in a bimodal distribution of 
LOR scores with high LOR scores representing SNPs where low 
coverage stochastically resulted in a zero in the denominator. 
These locations are spread across the entire genome, masking any 
SNPs linked to the causative mutation (Supplemental Fig. S2). 

Recent genome resequencing techniques in plants 
(Schneeberger et al. 2009) and C. elegans (Doitsidou et al. 2010) 
have used the density of SNP markers from the nonmutant line 
(based on previously generated SNP maps for parental strains) as an 
alternative metric to identify a putative linked region. However, 
RNA-seq data sets are often obtained as mutant and control pairs, 
and comparable sets are not obtained from parental lines, so this 
metric was not suited for our needs, because MMAPPR is specifi- 
cally designed to work without parental strain information. An- 
other potential method was to identify regions with homozygous 
SNPs in the mutant pool with or without incorporating these data 
into statistical models (Liu et al. 2012). However, RNA-seq is sus- 
ceptible to false positives in low-coverage regions, resulting in 
many regions throughout the genome showing a large number of 
homozygous SNPs, and this method is susceptible to imperfect 
phenotypic identification or penetrance because nonhomozygous 
candidates are automatically excluded. 

Given the reasons for rejecting the approaches outlined 
above, we chose to measure allele segregation using Euclidean 
distance (ED), as a metric that does not require parental strain in- 
formation and is resistant to noise, using the equation: 



ED - y (A mut — A wt ) + {Cmut — C wt ) + (G mu t — G w t) + (T mut — T wt ) 

where each letter (A, C, G, T) corresponds to the frequency of its 
corresponding DNA nucleotide. In practice, SNP loci with more 
than two variants are extremely rare, so two of the terms will be 
zero. Frequencies are used in place of raw read counts to com- 
pensate for read coverage differences for loci across the genome 
and between mutant and wild-type pools. Because frequencies 
cannot be accurately measured at low read counts, a minimum 
cutoff of 10 reads is used. ED is advantageous because it is linear, 
making it less prone to errors in allelic frequency analysis, and is 
able to subtract out sequence-specific errors, an artifact of Illumina 
sequencing technology (Nakamura et al. 2011) that is assumed to 
be equally present in both samples. In contrast, sequence-specific 
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artifacts can have a large effect on LOR, 
especially near a segregating mutant due 
to its asymptotic nature. 

As an example of the combined 
effects of noise on LOR and ED, we cal- 
culated the theoretically expected and 
experimentally observed LOR and ED 
scores for the nkx2.5 mutation. Expected 
scores were calculated using the read cov- 
erage numbers from the nkx2.5 RNA-seq 
data set and assuming the ideal fre- 
quencies of 100% mutant allele in the 
MUT pool and 33% mutant allele in the 
wild-type (WT) pool. The expected LOR 
for a completely linked SNP is 8.01, but 
the observed LOR score for the pre- 
viously identified nkx2.5 SNP was 1.20, 
which is only 15% of the expected value 
and within one standard deviation from 
the median. In contrast, using the same 
RNA-seq data sets, the expected ED was 
0.89 and the observed value was 0.71 — 
80% of the expected value and several 
standard deviations above the median. 
Therefore, using the ED robustly com- 
pensates for the noise found in RNA-seq 
data sets. 
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Signal processing 

Although measurement of allelic segre- 
gation between mutant and phenotypi- 
cally wild-type (WT) pools by ED was 
greatly improved and did not show the 
bimodal distribution of SNP enrichments 
seen with the LOR analysis (Supplemen- 
tal Fig. S2), it still showed considerable 
noise (Fig. 3A). Therefore, MMAPPR uses 
two signal-processing steps to identify 
the linked genomic sequence: raising the 
distance measurement to a power (ED X ) 
to decrease noise created by small varia- 
tions in the allelic frequency estimations 
(Fig. 2B) and local linear regression (Loess 
fit with a polynomial exponent of 1) 
(Cleveland 1988) of the EDs with a span 

automatically chosen by minimizing the corrected Aikaike In- 
formation Criterion (AICc) (Fig. 2C; Hurvich et al. 1998). 

Our method assumes that the Euclidean distance (ED) be- 
tween allele frequencies in the mutant and WT pools decreases 
proportionally to the genetic distance between a given SNP and the 
causative mutation (see Fig. IB). Therefore, small EDs are a mixture 
of noise and signal from relatively distant markers. As the ED in- 
creases, it becomes increasingly likely that a given distance mea- 
surement is signal and less likely that it is noise, while simulta- 
neously indicating stronger linkage. Therefore, we raise the allele 
frequency ED to a power to increase the effect of large ED mea- 
surements and decrease the effects of low ED measurements/noise 
(Fig. 3B). The effect of raising an ED to increasingly greater powers 
minimizes the effects of increasing portions of the data. Conse- 
quently, raising the distance to too large of a power results in the 
shape of the Loess curve being dominated by a few outlying points. 
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Figure 3. Data progression through noise reduction steps. (A) Raw Euclidean distance scores across 
the genome. For all panels, vertical gray lines delineate chromosome edges, and chromosome widths 
represent the relative number of SNPs on the chromosome. (B) Euclidean distance raised to the fourth 
power. (C) Loess fit curve calculated using the data shown in B. (D) Effect of raising the Euclidean 
distance to different powers on the Loess fit curve. All data are from the nkx2.5 mutant and pheno- 
typically wild-type RNA-seq data sets. (*) The mutation location. 



To determine the optimized power for our data sets, we ran 
MMAPPR on the nkx2.5 RNA-seq data at various powers and 
measured the width of the peak at its base and the proximity of the 
top of the peak to the mutation. Our results show that a power of 4 
resulted in the best fit (Fig. 3D). A power of 4 also worked well on 
the three other mutants shown here. 

A complementary method for reducing the effects of noise is 
to fit a curve to the raw data. However, there is not a readily ap- 
parent model that the ED data are expected to fit, excluding the use 
of parametric methods. We also found that smoothing methods 
that used a fixed window were sensitive to skewing in regions with 
low expressed SNP density, so we chose to fit the data using Loess 
regression. Loess regression is a nonparametric method for fitting 
curves based on a weighted average of points within a given span. 
In linkage analysis, SNPs located near each other should have the 
same ED because they are genetically linked, while SNPs located 
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progressively farther apart are progressively less linked, and thus 
the difference in their EDs is less restricted. Loess regression line- 
arly fits the data using a set number of points while weighting by 
the distance between each of the points. Points close to each other, 
which all should be genetically linked together, are weighted 
heavily while points farther apart carry less weight. Using Loess 
regression allows for and compensates for regions with low SNP 
density. In contrast, fixed window methods, which use a fixed 
number of points with equal weight, would result in windows that 
span several centimorgans in low SNP density regions, and fixed 
distance windows would have few points in low SNP density re- 
gions. Because MMAPPR is data driven, it also accounts for local 
recombination frequencies without requiring a previously gener- 
ated genetic map (Bowen et al. 2012) or requiring one to assume 
that the recombination frequency is the same across the genome, 
as required by HMM-based models (Leshchiner et al. 2012). We allow 
the span to be chosen by the software using AICc, which is commonly 
used for optimizing Loess fit curves (Hurvich et al. 1998). The result- 
ing Loess curve shows a very distinct peak (Fig. 3C) at the shared 
segment that was not readily visible in the raw data (Fig. 3 A). 

Candidate selection 

In addition to providing mapping information, RNA-seq data sets 
also contain mutation data for genes expressed at the devel- 
opmental time point when the tissue is collected. To identify 
putative causative mutations, MMAPPR selects SNPs within the 
identified peak(s) that have an ED above a threshold and have 
a high allele frequency in the mutant pool. Because of the noise 
inherent in the data set, we set these cutoffs conservatively with 
a minimum distance of 0.5 and a minimum mutant pool allele 
frequency of 0.75. Identified SNPs are then analyzed using the 
Alleler program (part of the Useq package) (Nix et al. 2008), which 
uses annotated genes to determine whether a given mutation is 
nonsynonymous. The result provides a list of putative mutations 
that are confined to the identified region and may impact the 
gene's protein product. 

The Alleler program is not able to identify causative muta- 
tions that are in genomic sequences outside the RNA data set, such 
as mutations in transcriptional regulatory regions, in genes that 
undergo nonsense-mediated decay, or mutations that consist of 
large deleted regions. Therefore, in addition to this analysis, we 
recommend that the data sets in hand be used for differential gene 
expression analysis using available software (for example, USeq, 
Cufflinks, Bioconductor) to identify genes in the identified region 
that have significantly different gene expression profiles. 

Validation 

We have verified MMAPPR using four mutants: two previously 
identified mutations, nkx2.S and tbxl tm20S , using RNA-seq from 
isolated heart tissue, and two unpublished mutations from an ENU 



screen performed in our laboratory, zyl3 and zyl4, using RNA-seq 
from whole embryos. Results from these experiments are sum- 
marized in Table 1. First, we developed our method and optimized 
the method's parameters using the previously mapped nkx2.5 line 
as shown in Figures 2B and 3. The final results for this line using the 
optimized parameters are shown in Figure 4A and Supplemental 
File 1. We next used these optimized parameters to analyze the 
previously mapped tbxl tm20S mutation (Fig. 4B; Supplemental File 
2). In both cases, MMAPPR identified a stop codon within 0.5 Mb 
of the maximum fitted peak value corresponding with the pre- 
viously identified mutations in each line. This shows that the op- 
timized MMAPPR parameters were not specific to the nkx2.5 line, 
but provided a strong starting point for novel mutation analysis. 

We next used MMAPPR on two mutants, zyl3 and zyl4, from 
an ENU screen for cardiovascular development mutants performed 
in our laboratory. The zyl3 mutant has severe pericardial edema 
and lacks melanophores and other migratory neural crest cells. 
Unlike the nkx2.S and tbxl 1 ™ 208 lines, the phenotype of this mu- 
tant did not clearly indicate a dysfunction in a readily accessible 
tissue, which will likely be a common situation in the analysis of 
novel mutants. Therefore, we asked whether MMAPPR was able to 
identify a mutation using RNA-seq from whole embryos (Fig. 5A,B) 
and compared this approach with the RAD-seq method (Baird et al. 
2008) used by the commercial mapping service Floragenex (Fig. 
5C). Interestingly, both methods identified a large region spanning 
approximately half of chromosome 7, although the peak found by 
MMAPPR was more informative because it had a maximum at —67 
Mb, while the RAD-seq region was flat, without a peak. 

To investigate possible causes for the large region on chro- 
mosome 7, we compared the published zebrafish genetic map 
(Bradley et al. 2011) with physical positions in the Zv9 genome 
assembly. In Figure 5D, lines between the two *-axes connect the 
positions of genetic mapping markers (in centimorgans) and their 
corresponding genome assembly locations (in megabases). Cross- 
ing lines indicate locations where the two maps disagree on the 
genomic order of the markers. Differences between the relative dis- 
tances between markers show localized differential recombination 
rates in the region. This analysis showed that this region falls very 
near the centromere (indicated by a black dot on the top and bottom 
*-axes), a region with very little recombination. There is also at least 
one pair of mapping markers that do not match the order in the 
physical genome assembly, indicating that there are possible enors 
in this region of the genomic build that may also affect the mapping 
results. Thus, it is likely that the large identified region was due 
to genomic characteristics that repress recombination around the 
centromere and inaccuracies in the genomic build. 

Unlike the RADseq result shown in Figure 5C (which was 
relatively flat throughout the region), MMAPPR showed a dis- 
cernible peak at —67 Mb. Candidate selection did not identify any 
stop codons in the region, but differential gene expression analysis 
of the RNA-seq data identified two genes near the peak maximum 
with greatly reduced expression levels in the mutant pool (Table 1). 



Table 1. Summary of MMAPPR results 
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PCR-based DNA sequencing confirmed the presence of a nonsense 
mutation in the ctr9 gene that appears to cause nonsense-mediated 
decay, and the ability to rescue the mutant phenotype by injection 
of wild-type ctr9 mRNA confirmed ctr9 as the causative mutation 
(MJJurynec, X Bai, A Nechiporuk, B Bisgrove, RA Somer, H Wilson, 
H Grunwald, Y-C Su, K Hoshijima, HJ Yost, et al., unpubl.). These 
results show that MMAPPR generates an accurate estimate of the 
mutant location even in centromeric regions or genomic regions 
that are not perfectly represented in current genetic maps and 
genome assemblies. 

The other mutant, zyl4, has a complex phenotype, including 
failure to form intersomitic vessels and head vasculature. MMAPPR 
results for this line are shown in Figure 6. The identified region was 
6.6 Mb, similar to the regions identified for nkx2.5 and tbxl and 
much smaller than the region identified for zyl3. Several non- 
synonymous mutations segregating with the phenotype were 
identified within this region, but only a single nonsense mutation 
was found in the cds2 gene (Table 1). Because MMAPPR works with 
RNA-seq libraries generated from 20 embryos, we analyzed these 
libraries for differential expression to see if the candidate was also 
identified. However, no genes with significantly different gene 
expression profiles were identified within the linked region using 
single sets of RNA-seq libraries. Since differential expression anal- 
ysis using a single RNA-seq replicate can be limited because the 
variance cannot be accurately estimated (Anders and Huber 2010), 
we tested differential expression analysis using three separate bi- 
ological replicates, each derived from 20 different embryos per 
mutant and phenotypically wild-type pool. This analysis showed 
that expression of cds2 RNA containing the nonsense mutation 
is greatly reduced in the mutant pool, likely due to nonsense- 
mediated decay, providing further evidence for its role in the ob- 
served phenotype. Based on this result and the optimization studies 
described above (Fig. 2), in the limited cases in which the differential 
expression analysis component of MMAPPR is required, we rec- 
ommend using at least three biological replicate RNA-seq libraries, 
as is common for differential expression analysis by RNA-seq, from 
pools each of 10-20 wild-type and mutant embryos. Including at 
least 10 individuals in each replicate will yield 30 individuals for 
mapping by combining the data sets, which provides a number of 
individuals above the minimum required (Fig. 2). 



We first confirmed the putative nonsense mutation in cds2 
by Sanger sequencing mutants and their siblings to show that 
it segregated as expected in the population (Fig. 6C,D). Next, we 
successfully rescued the mutant phenotype by injecting wild- type 
cds2 RNA injection into one-cell-stage embryos (Fig. 6E-G). 
Mutant cds2 mRNA, with the single base change, was unable to 
rescue the phenotype (Fig. 6H). After we conducted our confir- 
mation experiments, another group published a paper showing 
that a different mutation in the same gene gave a similar phe- 
notype (Pan et al. 2012). Together, these experiments indicate 
that MMAPPR correctly identified cds2 as the causative mutant in 
the zyl4 line. 

Discussion 

The MMAPPR method described here is able to identify candidate 
mutations without any parental strain or genotype information, 
without previously identified SNP map databases, and without 
data from separate individuals. By using only single RNA-seq li- 
braries from a small number of pooled mutant individuals and 
their phenotypically wild-type siblings, MMAPPR requires fewer 
animals than is normally required for traditional mapping and less 
sequencing data than is required for whole-genome sequence 
mapping. In addition, unlike whole-genome sequencing, the same 
RNA-seq data sets can also be used for transcriptome analyses of 
mutants. Furthermore, MMAPPR will allow identification of 
X-linked mutations because mutant allele frequency in the mutant 
pool will be 1, while the mutant allele frequency in the wild- type 
pool will be 0.2, creating a Euclidean distance between populations 
that is actually greater than the autosomal recessive case. A linked 
region can also be identified for dominant mutations if they are 
fully penetrant using the Euclidean distance equation here, al- 
though modifications to the MMAPPR program would have to be 
made for candidate SNP identification. Together, these attributes 
make MMAPPR an efficient and cost-effective means to identify 
spontaneous or induced mutations. 

Using RNA-seq also allows candidate genes within the linked 
region to be identified by three different bioinformatics approaches. 
First, SNPs can be analyzed by their effect on the protein. We have 
integrated this analysis into MMAPPR because it directly uses the 
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data generated for the linkage analysis. Second, gene expression 
analysis can be done using a number of available tools (USeq, 
Cufflinks, Bioconductor) to identify putative mutations that re- 
duce mRNA levels, due either to mutations in the coding regions 
that lead to nonsense-mediated decay or to mutations in gene 
regulatory regions. Finally differential splicing can be analyzed 
using several tools (USeq, SpliceGrapher, SpliceSeq, KISSPLICE). By 
finding the intersection between the lists generated by these tools 
and the MMAPPR-identified region, one can generate a robust list 
of candidate mutations underlying the phenotype in question. 
The use of these lists will differ on a case-by-case basis. 

MMAPPR can be used for a wide variety of model and non- 
model organisms. For any organism, the criteria are a moderately 
well-assembled genome, a sufficient level of sequence poly- 



morphism (typical of most model organ- 
isms that are not highly inbred), and 
a sufficient number (—20) of F2 offspring 
that can be pooled by phenotype for RNA- 
seq. These F2 offspring do not have to be 
siblings, but can be generated from crosses 
of multiple Fl carrier siblings, as we did 
with the nkx2.5 and tbxl lines. In organ- 
isms that have a genome assembly but not 
a strong transcriptome annotation, tools 
are available to build a transcriptome from 
RNA-seq (Grabherr et al. 2011). 

Although MMAPPR worked well for 
the four examples shown here, it has 
several limitations. As currently imple- 
mented, it is unable to directly identify 
indels that are small enough not to affect 
overall gene expression levels. MMAPPR 
is capable of mapping larger deletions 
(data not shown). It is unable to identify 
genes that are missing from the reference 
build or are incorrectly annotated. Fi- 
nally, it is unable to directly identify the 
causative lesion if the pooled samples are 
collected after the gene is no longer 
expressed or the mutation lies in untran- 
scribed genomic regions. Nonetheless, it is 
important to note that in each of these 
cases, MMAPPR will identify the genomic 
region containing the lesion, and in some 
cases the affected gene can be identified 
using differential gene expression or 
splicing analysis, as described above. Any 
RNA-seq-based mapping method also 
cannot be used in cases in which the 
ability to isolate RNA for library construc- 
tion is destroyed by tissue fixation or other 
processes necessary to identify the mutant 
phenotype. 

We suggest several experimental de- 
sign decisions to help increase one's odds 
of success. First, RNA should be extracted 
as soon as possible after onset of the 
phenotype to increase the likelihood that 
the causative gene is expressed and cap- 
tured in the RNA-seq libraries. Second, we 
found that MMAPPR works with RNA 
isolated from whole animals or from tis- 
sue. Of note, the peaks for tbxl and zyl4, which both fall near each 
other on chromosome 5, were similar in size, even though Tbxl 
was identified from isolated tissue RNA-seq and zyl 4 was identified 
from whole-embryo RNA-seq. However, we recommend when 
feasible that RNA isolated from the cells or tissues of interest be 
used for RNA-seq. In this case, MMAPPR will provide a smaller 
candidate pool because it only identifies candidates in the ge- 
nomic region that are expressed at the right time in the right 
tissue. Here, the nkx2.5 and tbxl data sets were derived from 
hearts dissected from embryonic zebrafish. This tissue still pro- 
vided a sufficient number of expressed SNPs for accurate map- 
ping and, concurrently, a very small list of candidates within the 
mapped region (Table 1). Finally, increasing the read depth will 
increase the likelihood of detecting causative mutations in low- 
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expressed genes. It is commonly recommended to use at least 
three replicates for accurate differential gene analysis. Based on 
our results, three biological replicates containing at least 10-20 
individuals each should be sufficient for this analysis. These 
replicates can be barcoded and multiplexed to reduce sequencing 
costs. 

In conclusion, MMAPPR is a robust and effective way to map 
mutations generated from forward genetic screens or that spon- 
taneously arise in a population. Adoption of this method might 
remove many of the barriers for researchers hesitant to conduct 
large-scale mutagenesis screens due to the labor-intensive and te- 
dious mapping process. 



Methods 

Animal care and mating 

All fish were kept in the University of Utah Centralized Zebrafish 
Animal Resource facility or in the Yost Laboratory Zebrafish facility 
according to IACUC-approved protocols. For the nkx2.5 and 
tbxl tm208 line, fish were received from the Yelon laboratory and the 
Trede laboratory, respectively, and mated into the cmlc2:GFP line 
maintained on an AB background. Offspring carrying both the 
cmlc2:GFP and the appropriate mutation were selected and mated. 
The zyl 3 and zyl 4 lines were identified from a standard F2 muta- 
genesis screen carried out in the AB strain. Mutant lines were 
subsequently maintained by outcrossing to the Wik zebrafish line 
(zyl3, zyl4) and the Tg(flil-EGFP) and Tg(kdrl-EGFP) lines {zyl4) 
(Lawson and Weinstein 2002; Beis et al. 2005). 

RNA collection and sequencing 

Offspring from zebrafish matings were raised to 30 hours post- 
fertilization (hpf) (zyl3 and zyl4), 48 hpf (nkx2.5), or 72 hpf 
(tbxl tm20S ) under standard conditions. These time points represent 
the earliest stage at which we could confidently identify the phe- 
notypes. Embryos were then segregated into mutant and pheno- 
typically wild-type groups based on morphological phenotype 
as follows: Nkx2.5 — enlarged atrium and diminished ventricle; 
Tbxl — loss of heart looping; zyl3 — pericardial edema; and 
zyl 4 — loss of intersegmental vessels. Pools of 20 whole embryos 
were collected for the zyl 3 and zyl 4 lines. For the other two lines, 
—500 hearts were isolated as previously described (Geoffrey Burns 
and MacRae 2006) and placed in TRIzol. Both whole embryos and 
isolated hearts were processed using TRIzol extraction followed by 
the QIAGEN RNeasy Mini kit (QIAGEN). Isolated RNA was run on 
a Bioanalyzer 2100 Pico Chip (Agilent) to confirm RNA quantity 
and quality and then used to generate cDNA libraries as previously 
published (Christodoulou et al. 2011) at the Harvard Biopolymers 
Facility or using the Illumina Truseq kit (Illumina) at the University 
of Utah Microarray and Genomic Analysis Shared Resource. Sam- 
ples were barcoded as wild-type (WT) versus mutant pairs (two 
barcodes per lane), and single-end 50-bp reads were generated on 
a Hiseq 2000 machine at the University of Utah Microarray and 
Genomic Analysis Shared Resource followed by processing using 
the Cassava 1.6 pipeline. Mapping was done using Novoalign 
(Novocraft) with default parameters except output was set to Sam 
format and FASTQ scoring to ILMFQ. Mapping was done using the 
Zv9 zebrafish build with splice junctions derived from the UCSC 
Refseq refflat gene table. Data on the number of reads obtained 
from the RNA-seq data sets are summarized in Table 1. 

Data processing 

A software implementation of MMAPPR was created using a 
combination of Python 3 and R and is available at http://yost. 
genetics.utah.edu/software.php. Briefly, the software package per- 
forms the following steps: First, Bam files are passed through the 
mpileup tool in the SAMtools package (Li et al. 2009) to create 
a pileup file. The pileup format converts the file to a position-based 
format showing the bases sequenced at each position. Reads at each 
position are filtered by the minimum base quality and minimum 
mapping quality set by the user, and then the frequency of each 
allele is calculated. These data are subsequently passed to R for signal 
processing and peak identification. First, the Euclidean distance is 
calculated at each SNP location using the equation: 
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where the letters (A, C, G, T) represent their corresponding bases. 
This distance is then raised to a power set by the user. Next, the 
data are fit using a Loess curve with a polynomial exponent of 
1 and a span parameter determined by minimizing the AICc. Peak 
regions are defined as regions where the Loess fitted values are 
greater than three standard deviations above the genome-wide 
median. R then plots the Loess fits and returns a list of SNPs within 
the identified region(s) that are enriched in the mutant pool (an 
allele frequency >0.75 and a Euclidean distance >0.5). The identi- 
fied SNPs are then passed to the Alleler program (part of the USeq 
package) to filter for nonsynonymous SNPs using the provided 
gene annotation. The position and effect of these SNPs are finally 
exported to an output file. MMAPPR uses the optimized values 
reported here as defaults but allows the following variables to be 
modified by the user: mapping quality (default = 30), base quality 
(default = 20), minimum read depth (default =10), power that ED is 
raised to (default = 4), and whether repetitive regions are masked 
(default = not masked). 

Causative allele confirmation 

ctr9 was chosen from the list of zyl3 candidates based on its po- 
sition relative to the peak of the mapped region and its known 
function and expression pattern. Because it was identified as 
a mutation that might lead to nonsense-mediated decay, it was first 
sequenced to identify a G-A mutation resulting in a nonsense 
mutation at amino acid 580. The functional significance of the 
mutation was confirmed by phenotypic rescue using wild-type 
ctr9 RNA (MJ Jurynec, X Bai, A Nechiporuk, B Bisgrove, RA Somer, 
H Wilson, H Grunwald, Y-C Su, K Hoshijima, HJ Yost, et al., 
unpubl.). cds2 was chosen as the primary candidate for the zyl4 
line based on both SNP and expression analysis (see Results). The 
mutation was confirmed by Sanger sequencing of a 327-bp geno- 
mic DNA fragment amplified using primers flanking the mutation 
(Fwd: TGCAGACTTCTTTGCAAGTAAAC and Rev: TTTGGACAC 
CCCTGCTTTAT). For RNA rescue confirmation, full-length wild- 
type and mutant cDNAs were amplified by PCR (Fwd: CCAG 
GCCTCTATTTTCACCA and Rev: CCTGGTGGTCCCAGAAGT 
TA) and inserted into pCS2+. Capped RNAs were synthesized with 
the mMessage Machine SP6 transcription kit (Ambion). Embryos 
derived from matings of zyl4 +/ ~;Tg (kdrl-EGFP) +/ ~ double hetero- 
zygous parents were injected at the one to two cell stage with 50 pg 
of wild-type or mutant cds2 RNA and scored under epifluorescent 
illumination at 40 hpf for rescue of the mutant intersegmental 
vessel phenotype. Because rescued embryos were phenotypically 
indistinguishable from their wild-type siblings, mutant genotypes 
were confirmed by Sanger sequencing. 

Differential gene expression analysis 

Gene expression analysis was performed using the Useq 7.8.1 
software package (Nix et al. 2008). A single RNA-seq replicate was 
used for the nkx2.5, tbxl*™ 208 , and zyl3 lines. For the zyl4 line, 
both three biological replicates and one replicate were run for 
comparison. Differentially expressed genes were defined as genes 
with a false discovery rate <0.05. 

Genetic distance estimation 

Genetic distances between peak maxima and causative mutations 
were estimated using the SNP/str combined genetic map and the 
Zv9 build (Bradley et al. 2011). Two genetic markers surrounding 
the region and their corresponding Zv9 location were first se- 
lected and used to create a linear model of genetic versus physical 
distance between the markers. This model was then used to esti- 



mate the genetic positions of each lesion and the corresponding 
peak maximum to calculate the genetic distance between them. 
Centromere positions were determined using data from previously 
published genetic maps (Shimoda et al. 1999; Mohideen et al. 2000). 

Data access 

Raw FASTQ files and aligned BAM files are available at the NHLBI 
Bench-to-Bassinet Consortium data-sharing hub (https://b2b.hci. 
utah.edu:443/gnomex/gnomexFlex.jsp?topicNumber=27). A soft- 
ware package for MMAPPR analysis and instructions for its use 
can be accessed from http://yost.genetics.utah.edu/software.php. 
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